Data transformation

R4DS 03 - Data transformation with dplyr

lruolin
04-28-2021

R4DS Practice 03: Data transformation with dplyr

The codes below are from R4DS book by Hadley Wickham.

Loading required packages:

The nycflights13 dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from US Bureau of Transportation Statistics.

head(flights)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1     1      517            515         2      830
2  2013     1     1      533            529         4      850
3  2013     1     1      542            540         2      923
4  2013     1     1      544            545        -1     1004
5  2013     1     1      554            600        -6      812
6  2013     1     1      554            558        -4      740
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Learning filter()

Selecting all flights on Jan 1st:

filter(flights, month == 1, day == 1) # filter month = 1 (Jan) and day = 1 (1st)
# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 832 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# to filter for flights on xmas day
flights %>% 
  filter(month == 12, day == 25)
# A tibble: 719 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    12    25      456            500        -4      649
 2  2013    12    25      524            515         9      805
 3  2013    12    25      542            540         2      832
 4  2013    12    25      546            550        -4     1022
 5  2013    12    25      556            600        -4      730
 6  2013    12    25      557            600        -3      743
 7  2013    12    25      557            600        -3      818
 8  2013    12    25      559            600        -1      855
 9  2013    12    25      559            600        -1      849
10  2013    12    25      600            600         0      850
# … with 709 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# to filter for flights departing in nov and dec
flights %>% 
  filter(month %in% c(11, 12))
# A tibble: 55,403 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    11     1        5           2359         6      352
 2  2013    11     1       35           2250       105      123
 3  2013    11     1      455            500        -5      641
 4  2013    11     1      539            545        -6      856
 5  2013    11     1      542            545        -3      831
 6  2013    11     1      549            600       -11      912
 7  2013    11     1      550            600       -10      705
 8  2013    11     1      554            600        -6      659
 9  2013    11     1      554            600        -6      826
10  2013    11     1      554            600        -6      749
# … with 55,393 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Exercise questions:

Q1 - Find all flights that had an arrival delay of two or more hours

flights %>% 
  filter(arr_delay >= 120)
# A tibble: 10,200 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      811            630       101     1047
 2  2013     1     1      848           1835       853     1001
 3  2013     1     1      957            733       144     1056
 4  2013     1     1     1114            900       134     1447
 5  2013     1     1     1505           1310       115     1638
 6  2013     1     1     1525           1340       105     1831
 7  2013     1     1     1549           1445        64     1912
 8  2013     1     1     1558           1359       119     1718
 9  2013     1     1     1732           1630        62     2028
10  2013     1     1     1803           1620       103     2008
# … with 10,190 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Q2 - Flew to Houston (IAH or HOU)

flights %>% 
  filter(dest %in% c("IAH", "HOU"))
# A tibble: 9,313 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      623            627        -4      933
 4  2013     1     1      728            732        -4     1041
 5  2013     1     1      739            739         0     1104
 6  2013     1     1      908            908         0     1228
 7  2013     1     1     1028           1026         2     1350
 8  2013     1     1     1044           1045        -1     1352
 9  2013     1     1     1114            900       134     1447
10  2013     1     1     1205           1200         5     1503
# … with 9,303 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Q3 - Were operated by United, American or Delta

flights %>% 
  filter(carrier %in% c("UA", "AA", "DL"))
# A tibble: 139,504 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      554            600        -6      812
 5  2013     1     1      554            558        -4      740
 6  2013     1     1      558            600        -2      753
 7  2013     1     1      558            600        -2      924
 8  2013     1     1      558            600        -2      923
 9  2013     1     1      559            600        -1      941
10  2013     1     1      559            600        -1      854
# … with 139,494 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Q4 - Departed in July, August, and September

flights %>% 
  filter(month %in% c(7:9))
# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7     1        1           2029       212      236
 2  2013     7     1        2           2359         3      344
 3  2013     7     1       29           2245       104      151
 4  2013     7     1       43           2130       193      322
 5  2013     7     1       44           2150       174      300
 6  2013     7     1       46           2051       235      304
 7  2013     7     1       48           2001       287      308
 8  2013     7     1       58           2155       183      335
 9  2013     7     1      100           2146       194      327
10  2013     7     1      100           2245       135      337
# … with 86,316 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Q5 - Arrived more than two hours late, but didn’t leave late

flights %>% 
  filter(arr_delay >= 120, dep_delay <=0)
# A tibble: 29 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1    27     1419           1420        -1     1754
 2  2013    10     7     1350           1350         0     1736
 3  2013    10     7     1357           1359        -2     1858
 4  2013    10    16      657            700        -3     1258
 5  2013    11     1      658            700        -2     1329
 6  2013     3    18     1844           1847        -3       39
 7  2013     4    17     1635           1640        -5     2049
 8  2013     4    18      558            600        -2     1149
 9  2013     4    18      655            700        -5     1213
10  2013     5    22     1827           1830        -3     2217
# … with 19 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Q6 - Were delayed by at least an hour, but made up over 30 minutes in flight

flights %>% 
  filter(dep_delay >=60, dep_delay - arr_delay >30)
# A tibble: 1,844 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1     2205           1720       285       46
 2  2013     1     1     2326           2130       116      131
 3  2013     1     3     1503           1221       162     1803
 4  2013     1     3     1839           1700        99     2056
 5  2013     1     3     1850           1745        65     2148
 6  2013     1     3     1941           1759       102     2246
 7  2013     1     3     1950           1845        65     2228
 8  2013     1     3     2015           1915        60     2135
 9  2013     1     3     2257           2000       177       45
10  2013     1     4     1917           1700       137     2135
# … with 1,834 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Q7 - Departed between midnight and 6am

flights %>% 
  filter(dep_time <= 600 | dep_time == 2400)
# A tibble: 9,373 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 9,363 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Using the between function to filter flights that departed in summer:

flights %>% 
  filter(between(month, 7, 9))
# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7     1        1           2029       212      236
 2  2013     7     1        2           2359         3      344
 3  2013     7     1       29           2245       104      151
 4  2013     7     1       43           2130       193      322
 5  2013     7     1       44           2150       174      300
 6  2013     7     1       46           2051       235      304
 7  2013     7     1       48           2001       287      308
 8  2013     7     1       58           2155       183      335
 9  2013     7     1      100           2146       194      327
10  2013     7     1      100           2245       135      337
# … with 86,316 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

How many flights have a missing dep_time?

flights %>% 
  filter(is.na(dep_time))
# A tibble: 8,255 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1       NA           1630        NA       NA
 2  2013     1     1       NA           1935        NA       NA
 3  2013     1     1       NA           1500        NA       NA
 4  2013     1     1       NA            600        NA       NA
 5  2013     1     2       NA           1540        NA       NA
 6  2013     1     2       NA           1620        NA       NA
 7  2013     1     2       NA           1355        NA       NA
 8  2013     1     2       NA           1420        NA       NA
 9  2013     1     2       NA           1321        NA       NA
10  2013     1     2       NA           1545        NA       NA
# … with 8,245 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Learning arrange()

flights %>% 
  arrange(year, month, day)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
flights %>% 
  arrange(desc(month))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    12     1       13           2359        14      446
 2  2013    12     1       17           2359        18      443
 3  2013    12     1      453            500        -7      636
 4  2013    12     1      520            515         5      749
 5  2013    12     1      536            540        -4      845
 6  2013    12     1      540            550       -10     1005
 7  2013    12     1      541            545        -4      734
 8  2013    12     1      546            545         1      826
 9  2013    12     1      549            600       -11      648
10  2013    12     1      550            600       -10      825
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Exercises

Using arrange() to sort all the missing values to the start?

flights %>% 
  arrange(desc(is.na(dep_time), dep_time))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1       NA           1630        NA       NA
 2  2013     1     1       NA           1935        NA       NA
 3  2013     1     1       NA           1500        NA       NA
 4  2013     1     1       NA            600        NA       NA
 5  2013     1     2       NA           1540        NA       NA
 6  2013     1     2       NA           1620        NA       NA
 7  2013     1     2       NA           1355        NA       NA
 8  2013     1     2       NA           1420        NA       NA
 9  2013     1     2       NA           1321        NA       NA
10  2013     1     2       NA           1545        NA       NA
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Sort flights to find the most delayed flights:

flights %>% 
  arrange(desc(dep_delay))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     9      641            900      1301     1242
 2  2013     6    15     1432           1935      1137     1607
 3  2013     1    10     1121           1635      1126     1239
 4  2013     9    20     1139           1845      1014     1457
 5  2013     7    22      845           1600      1005     1044
 6  2013     4    10     1100           1900       960     1342
 7  2013     3    17     2321            810       911      135
 8  2013     6    27      959           1900       899     1236
 9  2013     7    22     2257            759       898      121
10  2013    12     5      756           1700       896     1058
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Sort flights that left the earliest

flights %>% 
  arrange(dep_delay)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    12     7     2040           2123       -43       40
 2  2013     2     3     2022           2055       -33     2240
 3  2013    11    10     1408           1440       -32     1549
 4  2013     1    11     1900           1930       -30     2233
 5  2013     1    29     1703           1730       -27     1947
 6  2013     8     9      729            755       -26     1002
 7  2013    10    23     1907           1932       -25     2143
 8  2013     3    30     2030           2055       -25     2213
 9  2013     3     2     1431           1455       -24     1601
10  2013     5     5      934            958       -24     1225
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Sort flights to find the fastest flights

flights %>% 
  arrange(air_time) %>% 
  head()
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1    16     1355           1315        40     1442
2  2013     4    13      537            527        10      622
3  2013    12     6      922            851        31     1021
4  2013     2     3     2153           2129        24     2247
5  2013     2     5     1303           1315       -12     1342
6  2013     2    12     2123           2130        -7     2211
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Which flights traveled the shortest?

flights %>% 
  arrange(air_time) %>% 
  head()
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>
1  2013     1    16     1355           1315        40     1442
2  2013     4    13      537            527        10      622
3  2013    12     6      922            851        31     1021
4  2013     2     3     2153           2129        24     2247
5  2013     2     5     1303           1315       -12     1342
6  2013     2    12     2123           2130        -7     2211
# … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

which flights travelled the longest?

flights %>% 
  arrange(desc(air_time))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     3    17     1337           1335         2     1937
 2  2013     2     6      853            900        -7     1542
 3  2013     3    15     1001           1000         1     1551
 4  2013     3    17     1006           1000         6     1607
 5  2013     3    16     1001           1000         1     1544
 6  2013     2     5      900            900         0     1555
 7  2013    11    12      936            930         6     1630
 8  2013     3    14      958           1000        -2     1542
 9  2013    11    20     1006           1000         6     1639
10  2013     3    15     1342           1335         7     1924
# … with 336,766 more rows, and 12 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

Learning select()

# Selecting columns by name

flights %>% 
  select(year, month, day)
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows
# Select all columns between year and day

flights %>% 
  select(year:day)
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows
# Select all columns except those from year to day

flights %>% 
  select(-(year:day))
# A tibble: 336,776 x 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1      517            515         2      830            819        11
 2      533            529         4      850            830        20
 3      542            540         2      923            850        33
 4      544            545        -1     1004           1022       -18
 5      554            600        -6      812            837       -25
 6      554            558        -4      740            728        12
 7      555            600        -5      913            854        19
 8      557            600        -3      709            723       -14
 9      557            600        -3      838            846        -8
10      558            600        -2      753            745         8
# … with 336,766 more rows, and 10 more variables: carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
# Select() with everything()

flights %>% 
  select(time_hour, air_time, everything())
# A tibble: 336,776 x 19
   time_hour           air_time  year month   day dep_time
   <dttm>                 <dbl> <int> <int> <int>    <int>
 1 2013-01-01 05:00:00      227  2013     1     1      517
 2 2013-01-01 05:00:00      227  2013     1     1      533
 3 2013-01-01 05:00:00      160  2013     1     1      542
 4 2013-01-01 05:00:00      183  2013     1     1      544
 5 2013-01-01 06:00:00      116  2013     1     1      554
 6 2013-01-01 05:00:00      150  2013     1     1      554
 7 2013-01-01 06:00:00      158  2013     1     1      555
 8 2013-01-01 06:00:00       53  2013     1     1      557
 9 2013-01-01 06:00:00      140  2013     1     1      557
10 2013-01-01 06:00:00      138  2013     1     1      558
# … with 336,766 more rows, and 13 more variables:
#   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   distance <dbl>, hour <dbl>, minute <dbl>

Variables may also be defined and called upon using all_of() and any_of()

vars <- c("year", "month", "day", "dep_delay", "arr_delay")

flights %>% 
  select(all_of(vars))
# A tibble: 336,776 x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows
flights %>% 
  select(any_of(vars))
# A tibble: 336,776 x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows

Selecting variables that contain “time”

flights %>% 
  select(contains("time"))
# A tibble: 336,776 x 6
   dep_time sched_dep_time arr_time sched_arr_time air_time
      <int>          <int>    <int>          <int>    <dbl>
 1      517            515      830            819      227
 2      533            529      850            830      227
 3      542            540      923            850      160
 4      544            545     1004           1022      183
 5      554            600      812            837      116
 6      554            558      740            728      150
 7      555            600      913            854      158
 8      557            600      709            723       53
 9      557            600      838            846      140
10      558            600      753            745      138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>

Learning mutate()

Questions

  1. Currently dep_time and sched_dep_time are convenient to look at but hard to compute with because they are not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight
flights %>% 
  select(dep_time, sched_dep_time) %>% 
  mutate(dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
         sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440)
# A tibble: 336,776 x 4
   dep_time sched_dep_time dep_time_mins sched_dep_time_mins
      <int>          <int>         <dbl>               <dbl>
 1      517            515           317                 315
 2      533            529           333                 329
 3      542            540           342                 340
 4      544            545           344                 345
 5      554            600           354                 360
 6      554            558           354                 358
 7      555            600           355                 360
 8      557            600           357                 360
 9      557            600           357                 360
10      558            600           358                 360
# … with 336,766 more rows
  1. Compare air_time with arr_time - dep_time.
flights %>% 
  mutate((dep_time = dep_time %/% 100*60 + dep_time %/% 100) %% 1440,
         arr_time = (arr_time %/% 100*60 + arr_time %% 100) %% 1440,
         air_time_diff = air_time - (arr_time + dep_time)) %>% 
  filter(air_time_diff != 0)
# A tibble: 327,344 x 21
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <dbl>
 1  2013     1     1      517            515         2      510
 2  2013     1     1      533            529         4      530
 3  2013     1     1      542            540         2      563
 4  2013     1     1      544            545        -1      604
 5  2013     1     1      554            600        -6      492
 6  2013     1     1      554            558        -4      460
 7  2013     1     1      555            600        -5      553
 8  2013     1     1      557            600        -3      429
 9  2013     1     1      557            600        -3      518
10  2013     1     1      558            600        -2      473
# … with 327,334 more rows, and 14 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>,
#   (dep_time = dep_time%/%100 * 60 + dep_time%/%100)%%1440 <dbl>,
#   air_time_diff <dbl>
# check if air_time_diff = 0 

There are many flights in which there was a difference. This could be because some flights passed midnight, or crossed time zones.

Learn summarize(), group_by()

To find out mean delay for each month in each year:
flights %>% 
  group_by( year, month, day) %>% 
  summarize(delay = mean(dep_delay, na.rm = T))
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# … with 355 more rows

To explore the relationship between distance and average delay for each location:

flights %>% 
  group_by(dest) %>% 
  summarize(count = n(),
            dist = round(mean(distance, na.rm = T), 2),
            delay = round(mean(arr_delay, na.rm = T), 2)) %>% 
  filter(count > 20, dest != "HNL") %>% 
  ggplot(aes(x = dist, y = delay)) +
  geom_point(aes(size = count, alpha = 0.4)) +
  geom_smooth(se = F) +
  theme_classic() +
  theme(legend.position = "none")

References:

https://jrnold.github.io/r4ds-exercise-solutions/ https://r4ds.had.co.nz/

Citation

For attribution, please cite this work as

lruolin (2021, April 28). pRactice corner: Data transformation. Retrieved from https://lruolin.github.io/myBlog/posts/20210424_Tidyverse Chap 1 - Data visualization/

BibTeX citation

@misc{lruolin2021data,
  author = {lruolin, },
  title = {pRactice corner: Data transformation},
  url = {https://lruolin.github.io/myBlog/posts/20210424_Tidyverse Chap 1 - Data visualization/},
  year = {2021}
}